A Shared Task for a Shared Goal: Systematic Annotation of Literary Texts

نویسندگان

  • Nils Reiter
  • Evelyn Gius
  • Jannik Strötgen
  • Marcus Willand
چکیده

In this talk, we would like to outline a proposal for a shared task (ST) in and for the digital humanities. In general, shared tasks are highly productive frameworks for bringing together different researchers/research groups and, if done in a sensible way, foster interdisciplinary collaboration. They have a tradition in natural language processing (NLP) where organizers define research tasks and settings. In order to cope for the specialties of DH research, we propose a ST that works in two phases, with two distinct target audiences and possible participants. Generally, this setup allows both “sides” of the DH community to bring in what they do best: Humanities scholars focus on conceptual issues, their description and definition. Computer science researchers focus on technical issues and work towards automatisation (cf. Kuhn & Reiter, 2015). The ideal scenario– that both “sides” of DH contribute to the work in both areas– is challenging to achieve in practice. The sharedtask scenario takes this into account and encourages Humanities scholars without access to programming “resources” to contribute to the conceptual phase (Phase 1), while software engineers without interest in literature per se can contribute to the automatisation phase (Phase 2). We believe that this setup can actually lower the entry bar for DH research. Decoupling, however, does not imply strict, un-crossable boundaries: There needs to be interaction between the two phases, which is supported by our mixed organisation team. In particular, this setup allows mixed teams to participate in both phases (and it will be interesting to see how they fare). In Phase 1 of a shared task, participants with a strong understanding of a specific literary phenomenon (literary studies scholars) work on the creation of annotation guidelines. This allows them to bring in their expertise without worrying about the feasibility of automatisation endeavours or struggling with technical issues. We will compare the different annotation guidelines both qualitatively: by having an indepth discussion during a workshop, and quantitatively: by measuring inter-annotator agreement. This will result in a community guided selection of annotation guidelines for a set of phenomena. The involvement of the research community in this process guarantees that heterogeneous points of view are taken into account. The guidelines will then enter Phase 2 to actually make annotations on a semi-large scale. These annotations then enter a “classical” shared task as it is established in the NLP community: Various teams competitively contribute systems whose performances will be evaluated in a quantitative manner. Given the complexity of many phenomena in literature, we expect the automatisation of such annotations to be an interesting challenge from an engineering perspective. On the other hand, it is an excellent opportunity to initiate the development of tools tailored to the detection of specific phenomena that are relevant for computational literary studies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

c○2010 The Association for Computational Linguistics

The CoNLL-2010 Shared Task was dedicated to the detection of uncertainty cues and their linguistic scope in natural language texts. The motivation behind this task was that distinguishing factual and uncertain information in texts is of essential importance in information extraction. This paper provides a general overview of the shared task, including the annotation protocols of the training an...

متن کامل

The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text

The CoNLL 2010 Shared Task was dedicated to the detection of uncertainty cues and their linguistic scope in natural language texts. The motivation behind this task was that distinguishing factual and uncertain information in texts is of essential importance in information extraction. This paper provides a general overview of the shared task, including the annotation protocols of the training an...

متن کامل

Automatic Analysis and Annotation of Literary Texts

In this work a machine learning oriented perspective on computer aided support to literary analysis is presented. A representation of narrative phenomena is proposed and an automatic annotation model for such phenomena is trained on texts provided by a critic. As a short-term research task, we studied how the observable textual piece of evidence impact on the learning agent capabilities, over a...

متن کامل

MEANTIME, the NewsReader Multilingual Event and Time Corpus

In this paper, we present the NewsReader MEANTIME corpus, a semantically annotated corpus of Wikinews articles. The corpus consists of 480 news articles, i.e. 120 English news articles and their translations in Spanish, Italian, and Dutch. MEANTIME contains annotations at different levels. The document-level annotation includes markables (e.g. entity mentions, event mentions, time expressions, ...

متن کامل

Extracting Verbal Multiword Data from Rich Treebank Annotation

The PARSEME Shared Task on automatic identification of verbal multiword expressions aims at identifying such expressions in running texts. Typology of verbal multiword expressions, very detailed annotation guidelines and gold-standard data for as many languages as possible will be provided. Since the Prague Dependency Treebank includes Czech multiword expression annotation, it was natural to ma...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017